fix(spanner): fix grpc-gcp affinity cleanup and multiplexed channel usage leaks#12726
fix(spanner): fix grpc-gcp affinity cleanup and multiplexed channel usage leaks#12726
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements explicit cleanup for gRPC-GCP channel affinity and introduces reference counting for shared channel usage in MultiplexedSessionDatabaseClient to ensure proper resource management. Key changes include the addition of clearTransactionAndChannelAffinity to the SpannerRpc interface and its implementations, allowing the system to unbind affinity keys when transactions or single-use operations complete. Additionally, GapicSpannerRpc was updated to propagate affinity cleanup settings even when dynamic channel pooling is disabled. Review feedback suggests adding a null check in getGrpcGcpChannelPoolOptions to prevent a potential NullPointerException and recommends consolidating the clearChannelHintAffinity implementation to avoid code duplication between KeyAwareChannel and GapicSpannerRpc.
| GcpChannelPoolOptions channelPoolOptions = options.getGcpChannelPoolOptions(); | ||
| if (options.isDynamicChannelPoolEnabled()) { | ||
| return channelPoolOptions; | ||
| } | ||
|
|
||
| return GcpChannelPoolOptions.newBuilder() | ||
| .disableDynamicScaling() | ||
| .setAffinityKeyLifetime(channelPoolOptions.getAffinityKeyLifetime()) | ||
| .setCleanupInterval(channelPoolOptions.getCleanupInterval()) | ||
| .build(); |
There was a problem hiding this comment.
The method getGrpcGcpChannelPoolOptions assumes that options.getGcpChannelPoolOptions() returns a non-null value. If it returns null, a NullPointerException will occur when calling getAffinityKeyLifetime() or getCleanupInterval(). While SpannerOptions typically provides a default, adding a null check or ensuring a non-null value is safer for robustness.
static GcpChannelPoolOptions getGrpcGcpChannelPoolOptions(SpannerOptions options) {
GcpChannelPoolOptions channelPoolOptions = options.getGcpChannelPoolOptions();
if (options.isDynamicChannelPoolEnabled()) {
return channelPoolOptions;
}
GcpChannelPoolOptions.Builder builder = GcpChannelPoolOptions.newBuilder().disableDynamicScaling();
if (channelPoolOptions != null) {
builder.setAffinityKeyLifetime(channelPoolOptions.getAffinityKeyLifetime())
.setCleanupInterval(channelPoolOptions.getCleanupInterval());
}
return builder.build();
}There was a problem hiding this comment.
gcpChannelPoolOptions is always initialized to either merged user options. The builder setter also rejects null, adding a defensive null branch would just hide invariant breaks.
...nner/google-cloud-spanner/src/main/java/com/google/cloud/spanner/spi/v1/KeyAwareChannel.java
Outdated
Show resolved
Hide resolved
# Conflicts: # java-spanner/google-cloud-spanner/src/test/java/com/google/cloud/spanner/spi/v1/GapicSpannerRpcTest.java
3599670 to
ce64c9e
Compare
...nner/google-cloud-spanner/src/main/java/com/google/cloud/spanner/spi/v1/GapicSpannerRpc.java
Outdated
Show resolved
Hide resolved
...nner/google-cloud-spanner/src/main/java/com/google/cloud/spanner/spi/v1/GapicSpannerRpc.java
Outdated
Show resolved
Hide resolved
...nner/google-cloud-spanner/src/main/java/com/google/cloud/spanner/spi/v1/KeyAwareChannel.java
Outdated
Show resolved
Hide resolved
| synchronized (CHANNEL_USAGE) { | ||
| CHANNEL_USAGE.putIfAbsent(sessionClient.getSpanner(), new BitSet(numChannels)); | ||
| this.channelUsage = CHANNEL_USAGE.get(sessionClient.getSpanner()); | ||
| SharedChannelUsage sharedChannelUsage = CHANNEL_USAGE.get(this.spanner); |
There was a problem hiding this comment.
I think that we should seriously consider removing this manual channel distribution logic from this class. It was introduced when using the Gax channel pool to prevent low-QPS from sticking to just one channel. My understanding is that grpc-gcp will do that automatically, as it falls back to a round-robin scheme when there is low load. That would significantly simplify this class without breaking the purpose of what this was intended to do.
There was a problem hiding this comment.
Customers still have an option to disable grpc-gcp and switch to GAX so cleanup task should be done later.
| // read-only transactions tend to keep picking the same idle channel, so keep reads | ||
| // overlapping to verify distribution across the fixed-size pool. | ||
| mockSpanner.setExecuteStreamingSqlExecutionTime( | ||
| SimulatedExecutionTime.ofMinimumAndRandomTime(500, 0)); |
There was a problem hiding this comment.
Would it be possible to just freeze the mock server and then wait for the server to contain N requests, and then unfreeze it, instead of adding 500ms execution time for each query? I think that would achieve the same, but with less execution speed. (There should be a util method in the mock server for 'waitForRequests' or something like that)
There was a problem hiding this comment.
I tried this approach, but the current mock server freeze() is global and blocks earlier RPCs like session creation/transaction setup before enough ExecuteStreamingSql requests are enqueued. That made the test deadlock/time out. I kept the per-query streaming delay for now because it is deterministic and keeps the overlap in the specific RPC path we want to exercise.
If we want to switch to freeze/wait/unfreeze, I think we first need a mock-server utility that can either freeze only ExecuteStreamingSql or wait for an exact request count without globally blocking unrelated RPCs.
There was a problem hiding this comment.
You can work around that by making sure the CreateSession RPC has finished before you start the actual test. There are at least two ways to achieve that:
- Execute another query before freezing the mock server.
- Use
SessionPoolOptions#setWaitForMinSessions(..)to make sure that the creation of the session becomes a blocking operation.
There was a problem hiding this comment.
I tried both workarounds:
- prewarming with an extra query before
freeze() - setting
SessionPoolOptions#setWaitForMinSessionsDuration(...)
In this test path that still times out before the expected number of ExecuteStreamingSql requests reach the mock server, so the freeze/wait/unfreeze variant is still not stable here.
There was a problem hiding this comment.
@olavloite This was added in the commit https://github.com/googleapis/google-cloud-java/pull/12715/changes#diff-abde1c3268c5f4417d5742c6c8a5541c6cf3e7d05bfc0be134e044891f9dd181 causing github check failure
Summary
This change fixes two regressions in the Spanner Java client introduced around multiplexed session initialization and
grpc-gcp affinity handling.
1. Fix static
CHANNEL_USAGEleak inMultiplexedSessionDatabaseClientFixes: #12693
MultiplexedSessionDatabaseClientstored per-SpannerImplchannel usage state in a static map but never removedentries on close. After multiplexed client creation became unconditional in
SpannerImpl.getDatabaseClient(),applications that repeatedly created and closed
Spannerinstances could retain closedSpannerImplobjects, gRPCchannels, and related transport state indefinitely.
This change:
Map<SpannerImpl, BitSet>with reference-counted shared stateMultiplexedSessionDatabaseClientfor a givenSpannerImplclosesSpannerImpl2. Stop using bitset / bounded
% numChannelsaffinity for grpc-gcpFor grpc-gcp-enabled paths, channel affinity should use the raw random channel hint and rely on explicit unbind /
cleanup, rather than:
% numChannelsThis change:
GapicSpannerRpc% numChannelsmapping from the grpc-gcp call path3. Add explicit grpc-gcp affinity cleanup for multi-use read-only transaction close
Multi-use read-only transactions reuse a single random channel hint for the lifetime of the transaction. That hint
should remain stable across all reads in the transaction, then be explicitly cleaned up when the transaction closes.
This change:
close()ExecuteSqlRPC to SpannerKeyAwareChannel, using the routed endpoint associated with the transaction4. Apply grpc-gcp affinity cleanup settings without implicitly enabling DCP
grpc-gcp affinity cleanup settings (
affinityKeyLifetime,cleanupInterval) should be applied whether or not dynamic channel pool (DCP) is enabled.This change:
affinityKeyLifetimecleanupInterval